Resource reservation for storage system metadata updates

ABSTRACT

Storage systems track free blocks using various data structures and maps. For instance, free block maps may contain data blocks with values that indicate whether a block is free or not. When an operation results in a block being freed, the relevant data block in the maps must be written during an I/O operation to update the value. Large numbers of updates my occur after an operation that frees a large numbers of blocks, which can lead to performance degradation. Accordingly, disclosed are systems and methods for deferring updating of free block data tracking structures using logs.

BACKGROUND

Aspects of the disclosures herein generally relate to the field ofstorage systems, and, more particularly, to efficiently updating storagesystem metadata.

Storage systems commonly maintain metadata to facilitate theiroperation. For example, storage systems can maintain metadata indicatingwhich data blocks are available to be allocated, which data blocksbelong to particular storage objects, etc. While some of the metadataremains relatively static, other metadata is subject to frequentmodification. Modifications to the metadata can result in storage systemoverhead, thus decreasing the efficiency of the storage system itself.Decreased storage system efficiency can result in a poor userexperience, higher costs, etc. Increasing the efficiency of themodifications to the metadata can decrease the storage system overhead,thus increasing the performance of the storage system.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosures herein may be better understood, and features madeapparent to those skilled in the art by referencing the accompanyingdrawings.

FIG. 1 is a conceptual diagram depicting a storage system including ablock free unit with efficient metadata updates.

FIG. 2 is a conceptual diagram depicting the use of a read-aheadmechanism facilitated by a sorted list of block identifiers.

FIG. 3 is a conceptual diagram depicting the performance of a log switchbetween two different block logs.

FIG. 4 is a flowchart depicting example operations for appending blockidentifiers to an active log, performing a log switch, and sortingsubsets of the active log.

FIG. 5 is a flowchart depicting example operations for updating metadataassociated with freed blocks indicated in a sorted log.

FIG. 6 is a flowchart depicting example operations for generating asorted log by merging multiple sorted subsets of block identifiers.

FIG. 7 is a conceptual diagram illustrating the increased spatiallocality facilitated by a sorted log.

FIG. 8 depicts a block free unit with a resource reservation-basedworkload management unit.

FIG. 9 is a flowchart depicting example operations for reservingresources to free blocks.

FIG. 10 depicts an example computer system with a block free unit.

DETAILED DESCRIPTION OF EXAMPLE ILLUSTRATIONS

The description that follows includes example systems, methods,techniques, instruction sequences and computer program products thatembody techniques of the disclosures herein. However, it is understoodthat the described examples may be practiced without these specificdetails. For instance, although examples refer to using an active map totrack whether blocks are free or available to be allocated, othertracking structures can be utilized. In other instances, well-knowninstruction instances, protocols, structures and techniques have notbeen shown in detail in order not to obfuscate the description.

Storage devices, such as hard drives and solid state storage drives, aretypically formatted into data blocks (hereinafter “blocks”). A block is,typically, the smallest unit of storage that can read or written to. Inother words, if a block is four kilobytes in size, the entire fourkilobytes of the block is read or written, even if only one byte of datais actually changed. Blocks that are representative of a block existingon a storage device are referred to herein as physical blocks.

File systems are typically formatted in a similar manner files withinthe file system are collections of blocks, which are referred to hereinas logical blocks. In some instances, logical blocks and physical blockshave a one-to-one correspondence. In other words, if the physical blocksare four kilobytes in size, the logical blocks are four kilobytes insize, and the boundaries of the blocks correspond to physical blocks. Insome instances, however, the logical blocks do not directly correspondto physical blocks. For example, logical blocks might be eight kilobyteswhile physical blocks are four kilobytes in size. In such a scenario,each logical block corresponds to two physical blocks.

Regardless of whether the subject is physical blocks or logical blocks,a mechanism is typically used to track whether a block is being used tostore data (“allocated”) or is available to be allocated (“free”).Consider, for example, the creation of a file. If the default size ofthe file is four kilobytes and the logical block size is four kilobytes,the file system allocates a single logical block to the file. The dataassociated with the file can then be written to the logical block.Later, when the file is deleted, instead of actually deleting the filedata from the logical block, the file system changes a value to indicatethat the logical block is free (i.e., unallocated). Thus, no data isactually deleted, but the logical block is still available to beallocated by the file system.

Various mechanisms, such as free lists and bitmaps, exist to track freeblocks. A free list typically includes an indication, such as a blocknumber, of blocks that are free. A bitmap, on the other hand, includes aset of bits wherein each bit is associated with a particular block. Forexample, the first bit of the bitmap might be associated with block one,the second bit of the bitmap might be associated with block two, and thenth bit of the bitmap might be associated with block n. Each bit canthen be set to a particular value (0 or 1) to represent whether theblock is free. The descriptions herein will assume the value 0 in abitmap indicates a free block. The various data structures used to trackfree blocks, such as a bitmap or list, are referred to collectively astracking structures. A bitmap tracking structure used to track whichblocks are free is referred to as an “active map” (corresponding to a‘1’ signifying an allocated, or active, block).

As alluded to above, when certain operations are performed, blocks canbe transitioned to a free state (“be freed”). In some instances, theoperations performed to free a block can result in performancedegradation. Consider a scenario in which an active map is used to trackfree blocks. The active map is a data structure that is represented inmemory, on a storage device, etc. When a particular block is freed, thebit in the active map corresponding to that particular block is changedfrom a 1 to a 0 (corresponding to the transition between “used” and“free”). However, to change the individual bit, the entire block of datacontaining that bit is read and written to. If a large number of blocksare being freed, a block of data might be read and written to for eachof the freed blocks. Thus, the updating of the active map can result ina large number of input/output operations.

The above problems are not limited to bitmaps, either, but rather mostmechanisms that track free blocks. For example, to free a block using afree list, a block identifier is inserted or appended to the list. Tothen mark the block as in use, the corresponding block identifier isremoved from the free list. Removing the block identifier from the freelist is typically done by searching the free list for the particularidentifier and deleting it. Thus, even though the updates occur whenallocating blocks instead of freeing them, random updates to the freelist may still occur at random locations within the free list, resultingin a large number of input/output operations.

While the tracking structures described above focus on tracking whetherblocks are free or allocated, other tracking structures might exist. Forexample, in storage systems that support deduplication, multiple storageobjects (such as files) might reference a single block. A blocktypically cannot be freed unless there are no additional references tothat block. A tracking structure similar to a bitmap can be used totrack the number of references to a single block. Instead of a blocknumber corresponding to a single bit, a reference count trackingstructure might map block numbers to groups of bits large enough tostore the maximum number of references that can refer to a single block.

Further, tracking structures are just particular examples of metadatathat may be updated during the operation of a storage system. Ingeneral, metadata that is changed subject to operations that mightresult in random updates to data can cause performance degradation asdescribed above. Thus, the disclosures herein are applicable to othertypes of metadata as well. The illustrations herein will use theupdating of block-related tracking structures as examples, but thedisclosures can be adapted for other scenarios as well.

The performance penalty associated with random updates described abovecan be mitigated by not updating a tracking structure as soon as it isdetermined that the block is no longer needed. Instead, the blocks thatare to be freed can be tracked in deferred-free block logs (hereinafter“block logs”). When an operation results in a freed block, a blockidentifier, such as the block number, is appended to an activedeferred-free block log (hereinafter “active log”). As the active logfills up with block identifiers, subsets of the block identifiers aresorted with respect to other block identifiers in the respectivesubsets. Generally, the subsets of the block identifiers are contiguousentries in the active log. For example, the first ten block identifiersin the active log might be the first subset, the second ten blockidentifiers in the active log might be the second subset, etc.

Once the number of block identifiers in the active log (or size of theactive log itself) reaches a particular threshold, a log switch occursin which the active log becomes an inactive log and a previous, emptyinactive log becomes the new empty active log. After the log switch, nomore block identifiers are appended to the inactive log (previously theactive log), but are appended to the new active log.

Once an active log becomes the inactive log, the various subsets ofblock identifiers (which are internally sorted) are merged with eachother into a single sorted list of block identifiers. This single sortedlist is stored in a sorted log and the inactive log is truncated orotherwise emptied. Once the sorted log is established, the sorted log isiterated through and the block identifiers to be freed are used toupdate the tracking structures. By sorting the block identifiers, thespatial locality of these updates to the tracking structures isincreased, which can reduce the number of input/output operationsperformed to update the tracking structures.

FIG. 1 is a conceptual diagram depicting a storage system including ablock free unit with efficient metadata updates. FIG. 1 depicts astorage system including a storage controller 106 and a storage device124. The storage controller 106 includes a block free unit withefficient metadata updates (hereinafter “block free unit”) 108 and theblock free unit 108 includes an insertion unit 110, a sort unit 112, asorted merge unit 114, and a free unit 116. FIG. 1 also includes aclient 102.

The block free unit 108 (or the components thereof) operate on a set ofblock logs, including an active log 118, an inactive log 120, and asorted log 122. In the examples described herein, the active log 118becomes the inactive log 120 after a “log switch”, while the sorted log122 is independent of the active log 118 and inactive log 120, asdescribed in more detail below. The block logs, or portions thereof, canbe stored in memory (not depicted) located on the storage controller 106or on one or more storage devices, such as the storage device 124. Theactive log 118 and inactive log 120 are divided into fixed size subsetsof block identifiers. The subsets are identified by the brackets as wellas bolded outlines.

At stage A, the client 102 issues a command 104 to storage controller106. In this instance, the command 104 is a block-level command thatspecifically indicates that block ‘833’ should be freed. The particularcommands that can result in a block being freed can vary betweenprotocols, storage system configurations, etc. For example, a commandmight indicate that data should be “deleted” instead of explicitlystating that a block should be freed. As another example, a command canbe a file-level command instead of a block-level command. A file-levelcommand might specify that a particular file should be deleted. Thestorage controller 106 can convert file-level commands to block-levelcommands by determining which blocks contain data for the particularfile referenced. When the storage controller 106 receives the command104, the storage controller 106 determines that the command 104 resultsin a block being freed and notifies the insertion unit 110 of theparticular block identifier (833′, in this case).

At stage B, the insertion unit 110 appends the block identifier(received at stage A) to the active log 118. The insertion unit 110 canappend the block identifier to the active log 118 by writing the blockidentifier to a memory or other storage location associated with theactive log 118. For example, the insertion unit 110 can maintain a filepointer that indicates the location on a storage device that isimmediately after the location in which a previous block identifier wasinserted into the active log 118.

The active log 118 can be divided into subsets. In this particularexample, the active log 118 is divided into subsets #0, #1, and #2. Themaximum size of the subsets is fixed at four block identifiers. Inpractice, the subsets are typically much larger (e.g., 1.25×2²⁰ blockidentifiers) and the subset size might be variable. In this particularexample, insertion unit 110 appends block identifier 833 to the activelog 118, resulting in subset #3 having three block identifiers.

The insertion unit 110 also tracks, or otherwise determines, the size ofthe active log 118 and the status of the individual subsets. Inparticular, the insertion unit 110 determines when the size of theactive log 118 reaches a particular threshold and, similarly, when aparticular subset reaches a particular size threshold. The sizethreshold for the active log 118 and the subsets can be measured invarious units, such as bytes or counts of block identifiers. In otherwords, the size threshold for each subset might be one megabyte or1.25×2²⁰ block identifiers.

If the insertion unit 110 determines that a subset has reaches theparticular threshold, the operations depicted at stages C and D areperformed. If the insertion unit 110 determines that the active log 118has reached the particular threshold, the operations depicted at stagesE, F, and G are performed.

At stages C and D, the insertion unit 110 notifies the sort unit 112that a particular subset has reached the particular threshold and thesort unit 112 sorts the block identifiers in the particular subset.

At stage C, the insertion unit 110 notifies the sort unit 112 thatsubset #1 has reached the size threshold (four block identifiers, inthis example) in response to appending a block identifier to the activelog 118 that constitutes the fourth block identifier of subset #1. Inthis particular case, stage C would occur after the insertion unit 110inserted block identifier 484 into the active log 118.

When the insertion unit 110 notifies the sort unit 112 that a particularsubset has reached the particular threshold, the insertion unit 110 canidentify the particular subset in a variety of ways. For example, theparticular subset can be identified by the subset number (in this case,‘1’). When the subsets are fixed size, the initial entry associated withthe subset can be identified by multiplying the subset number by thefixed subset size (or similar technique adapted for a particularconfiguration). The particular subset can also be identified byindicating which entries in the active log 118 correspond to the subset.For example, subset #1 could be identified as entries four through seven(assuming zero-based numbering). The insertion unit 110 can also includea file pointer or other means to access the active log 118 with thenotification.

At stage D, the sort unit 112 sorts the subset identified by theinsertion unit 110 at stage C. In particular, the sort unit 112 sortsthe block entries of the subset relative to each other, such that theblock identifiers contained in the subset are in ascending order. Theparticular sorting technique used, such as a quicksort or mergesortalgorithm, can vary depending on the storage system configuration,including the number of entries in a subset, whether the subset can bestored entirely in memory, etc.

The subset remains in the same position within the active log 118 afterbeing sorted. In other words, the entries within the subset are merelyreordered. For example, subset #1 is depicted in FIG. 1 as beingunsorted in the active log 118, but sorted in the inactive log 120.Subset #1 in the inactive log 120 still comprises entries four throughseven, as in the active log 118. It should be noted that while FIG. 1depicts subset #1 as being unsorted in the active log 118 and sorted inthe inactive log 120, the roles of active log and inactive log areindependent from the sorting of subsets. Thus, after stage D, subset #1can be sorted while in the active log 118 remains the active log(similar to the depiction of subset #0).

At stages E, F, and G, the active log 118 becomes the inactive log 120,the insertion unit 110 notifies the sort unit 112 that the last subsetof the (now) inactive log 120 should be sorted, and the sort unit 112sorts the last subset of the inactive log 120.

At stage E, the block free unit 108 (or other component, such as theinsertion unit 110), in response to determining that the active log 118has reached a particular threshold, performs a log switch, making theactive log 118 the inactive log 120. In effect, the role of the activelog 118 is changed such that block identifiers are no longer appended tothe active log 118, thus making the active log 118 “inactive”. Once theactive log 118 is made the inactive log 120, a previous inactive log ismade the new active log, to which new block identifiers are appended.Thus, the active log 118 becomes the inactive log 120 at stage E. Theactual log switch is described in more detail below.

At stage F, the insertion unit 110 notifies the sort unit 112 that thelast subset of the inactive log 120 is to be sorted. Similar to theoperations described above at stage C, the insertion unit 110 canidentify the particular subset in a variety of ways. Similarly, theinsertion unit 110 can identify that the subset is for the inactive log120 instead of an active log.

At stage G, the sort unit 112 sorts the last subset (subset #2) of theinactive log 120. To sort subset #2, the sort unit 112 can performoperations substantially similar to those described above at stage D.The sort unit 112 might also utilize different operations. For example,the active log 118 might have reached the particular threshold prior tosubset #2 reaching the size threshold, as depicted here. Thus, subset #2can include fewer block identifiers than the other subsets, which mightmake different sorting algorithms more advantageous. Regardless, afterstage G, subset #2 is sorted in ascending order like the other subsetsof the inactive log 120.

At stage H, the sort unit 112 notifies the sorted merge unit 114 thatall subsets of the inactive log 120 are sorted. The notification caninclude a mechanism for the sorted merge unit 114 to access the inactivelog 120, such as a file pointer. Including a file pointer or othermechanism to access, or identify, the inactive log 120 facilitates theuse of multiple logs that alternate between the active and inactiveroles. Thus, the sort unit 112 indicates to the sorted merge unit 114which of the multiple logs is currently the inactive log.

At stage I, the sorted merge unit 114 merges the sorted subsets of theinactive log 120 and generates the sorted log 122. To generate thesorted log 122, the sorted merge unit 114 utilizes a modified heapsort.To utilize the modified heapsort, the sorted merge unit 114 generates a“min-heap” using the smallest block identifier of each subset in theinactive log 120. A min-heap is a binary heap in which a parent node isassociated with a value that is less than all of the parent node'schildren. In the current example, the min-heap would consist of theblock identifiers ‘42’, ‘395’, and ‘642’, with the block identifier ‘42’being the root of the min-heap. As with a typical heapsort, the rootelement is removed (block identifier ‘42’) and is written as the firstelement of the sorted log 122. The entry in the inactive log 120corresponding to the root elements is removed from the inactive log 120as well. A sift-up operation is performed on the min-heap, thusmaintaining the min-heap properties. The next lowest block identifierfrom the subset associated with the removed root node is added to themin-heap as a new element. This process continues until all blockidentifiers are written to the sorted log 122. Because the entries inthe inactive log 120 are removed as the corresponding block identifieris written to the sorted log 122, the inactive log 120 contains no blockidentifiers after the sorted log 122 is generated.

When removing the block identifiers from the inactive log 120, thesorted merge unit 114 performs a “hole punch”. A hole punch occurs whena particular block identifier in the inactive log 120 is removed withoutshifting the other block identifiers in the inactive log 120 to take theplace of the removed block identifier. In other words, when a particularblock identifier is removed from the inactive log 120, a hole is left inthe inactive log 120.

At stage J, the sorted merge unit 114 notifies the free unit 116 thatthe sorted log 122 has been generated. The sorted merge unit 114 canidentify the sorted log 122 and/or provide a mechanism for the free unit116 to access the sorted log 122, such as a file pointer.

At stages K, L, and M, the free unit 116 iterates through the sorted log122 and updates metadata associated with the block identifiers. FIG. 1depicts the storage device 124 as including the metadata, which isrepresented by an active map 126 and a reference count map 128. Asdescribed above, the active map 126 is a bitmap in which each bitcorresponds to a particular block. If the bit is set to ‘0’, the blockis free; if the bit is set to ‘1’, the block is allocated. The referencecount map 128 is also similar to that described above, in which groupsof bits correspond to individual blocks and store the count ofreferences associated with each respective block.

Stages K, L, and M depict a single iteration through the sorted log 122.Thus, in actual operation, the free unit 116 will typically repeatedlyperform the operations depicted at stages K, L, and M until there are nomore block identifiers in the sorted log 122.

At stage K, the free unit 116 reads a block identifier from the sortedlog 122. The free unit 116 maintains a pointer to a current blockidentifier in the sorted log 122. The pointer is initiated to the firstblock identifier (42′ in this example). After the free unit 116 readsthe block identifier, the pointer is updated to point to the next blockidentifier (350′ in this example). Thus, to read the block identifierfrom the sorted log 122, the free unit 116 reads the block identifierindicated by the pointer. The pointer is then updated to point to thenext block identifier.

At stage L, the free unit 116 reads data from the reference count map128 corresponding to the block identifier read at stage K and decrementsthe appropriate value, then writes the data back to the reference countmap 128. Because read and writes to the storage device 124 occur on aper block basis, the free unit 116 reads an entire block of data. Whilea block of data might be four kilobytes in size, one or two bytes mightbe used to store the reference count for a particular block. Thus, thefree unit 116 might read a significant amount of data in order to updatea small portion of that data.

When the free unit 116 decrements the reference count, the free unit 116also determines whether the decrement results in the reference countbeing zero. If the reference count is zero, the free unit 116 performsthe operations depicted at stage M. If the reference count is not zero,the free unit 116 does not perform the operations depicted as stage M.

At stage M, the free unit 116 reads data from the active map 126corresponding to the block identifier read at stage K and sets theappropriate bit to ‘0’, then writes the data back to the active map 126.As described above, a block is generally the smallest unit of storagethat can be read or written to. Thus, even though only a single bit ischanged to update the active map 126, an entire block is read andwritten in order to update a single bit.

Although the stages described above are depicted as occurringsequentially, at least some of the stages can occur in parallel. Forexample, once a log switch occurs (e.g., at stage E), the operationsdepicted at stage B can be performed with the new active log while theoperations at stage G are performed using the inactive log 120. Further,the sort unit 112 might be capable of performing multiple sorts inparallel (e.g., using multiple threads or processes), allowing theoperations of stage D and G to occur in parallel.

The sorted log 122 (and the subsets of the active log 118 and inactivelog 120) are described and depicted as being sorted in ascending order.The illustrations herein assume that blocks are identified based ontheir order on the storage device 124. For example, block 1 issequentially followed by block 2, block 2 is sequentially followed byblock 3, block n is sequentially followed by block n+1, etc. In storagesystem configurations in which this property is not true, the sortingcan occur according to a different ordering.

Additional un-depicted operations may be performed to facilitate theoperations depicted in FIG. 1. For example, the block free unit 108might not be capable of performing some operations in parallel. In otherwords, some operations might be mutually exclusive. For example,consider the merging of the subsets of the inactive log 120 at stage Iand the freeing of the blocks at stages K through M. Actively generatingthe sorted log 122 while also removing block identifiers might result inunintended scenarios. For example, if the block identifiers are removedfaster than they are added by the sorted merge unit 114, the free unit116 might reach the end of the sorted log 122 and stop, even though theentire sorted log 122 has not been generated. In this, and similarscenarios, access to particular components or resources can becontrolled by a state machine. For example, if the sorted merge unit 114is actively generating the sorted log 122, the block free unit 108 mightbe set to a “MERGE” state. The free unit 116 can delay any attempt tofree blocks while the state is set to “MERGE”. Similarly, when the freeunit 116 is freeing blocks, the block free unit 108 might be set to a“FREE” state. The sorted merge unit 114 would delay the generation ofthe sorted log 122 while the state is set to “FREE”. When neitheroperation is occurring, the state might be set to an “IDLE” state.

The sorting of the block identifiers is, effectively, the application ofa particular measure of spatial locality. Consider, for example, alinear representation of data in which the blocks of data are identifiedby sequential integer block identifiers (e.g., the first block isidentified by block identifier ‘1’, the second block is identified byblock identifier ‘2’, etc.). The spatial locality between a set ofblocks can be measured by the difference between the blocks respectiveblock identifiers. Thus, for example, blocks ‘15’ and ‘18’ have agreater spatial locality than blocks ‘20’ and ‘30’ (differences betweenthe block identifiers being three and ten, respectively). Sorting theblock identifiers in ascending (or descending) order effectively groupsthe block identifiers by the particular measure of spatial locality,minimizing the distance between the blocks associated with the blockidentifiers in the block logs. The ability to represent data on storagedevices, such as hard disks, in a linear manner is a particularcharacteristic of the storage devices themselves which might not beshared between all storage devices. Thus, sorting a block log in anascending order based on the block identifier might not be the mostappropriate application of a particular measure of spatial locality. Forexample, a storage device might use a mechanism that benefits fromspatial locality in two dimensions. In such a storage device, theparticular measure of spatial locality might be the distance betweenblocks of data in two dimensions instead of a single dimension. Thus,the particular technique used to sort the block identifiers can varybased on a variety of factors, including the particular measure ofspatial locality appropriate to the storage system.

Additionally, the sorting technique used can be combined with othertechniques, such as modular arithmetic. For example, the set ofavailable block identifiers can be divided into ranges, such as blockidentifiers ‘0’-‘999’, ‘1000’-‘1999’, etc. The ranges of blockidentifiers can be identified based on a range identifier (e.g., ‘0’ forblock identifiers ‘0’-‘999’, ‘1’ for block identifiers ‘1000’-‘1999’,etc.). The ranges of block identifiers can then be grouped based on theassociated identifier. For example, to group every other range (e.g.,grouping ranges ‘0’, ‘2’, ‘4’, etc. into a first group and groupingranges ‘1’, ‘3’, ‘5’, etc. into a second group), the range identifier isdivided by two and ranges of block identifiers are grouped based on theremainder. Similarly, to group every four ranges, the range identifieris divided by four instead of two.

It should be noted that the sorting techniques employed during theprocess depicted in FIG. 1 result in particular characteristics thatmight not exist if a block log was not sorted until the block logreached a particular threshold. For example, if a block log with a largenumber of block identifiers was sorted at once, the resulting impact toperformance of a controller could be significant, both by usingprocessor cycles as well as increasing the number of read and writeoperations occurring on the one or more storage device(s) on which theblock log were stored. However, by sorting subsets of the blockidentifiers individually, the cost of sorting an entire block log isamortized over a period of time, resulting in more predictableperformance (or a less apparent performance impact) than sorting anentire block log at once. While an additional sort is used to merge thesubsets together to form a sorted log, the resulting increase inoverhead is generally less apparent to storage system clients than mightoccur if an entire block log were sorted at once.

While some examples of commands that can result in blocks being freedare discussed above, additional examples might be useful to furtherillustrate the subject matter. Consider at least one difference betweena file system that updates data in place and a file system that uses awrite-anywhere mechanism. When a controller writes data associated witha particular file to a file system that uses in-place updates, thecontroller writes the data to the same blocks that the particular fileis already associated with. For example, assume that file A is stored atblocks 100-200. If a controller receives a command indicating that datafor file A should be written, the controller writes the data to blocks100-200.

A file system that uses a write-anywhere mechanism, on the other hand,functions differently. When a controller writes data associated with aparticular file to a file system that uses a write-anywhere mechanism,the data is written to a set of new blocks on a storage device. Thus,for example, if file A is stored at blocks 100-200, the controller mightwrite the data to blocks 500-600. Once the data is written, blocks100-200 are freed. Thus, when the file system uses a write-anywheremechanism, each write command, generally, involves freeing one or moreblocks.

Further, commands issued by clients are not the only way that blocks canbe freed. A controller can perform various management operations,including moving data between blocks on a storage device, which canresult in blocks being freed. Further, other components of a storagedevice might issue commands (such as replication commands between nodesin a clustered storage system). In other words, there are a variety ofpossible reasons why a block may be freed beyond commands received fromclients.

Read-Ahead Example Illustration

FIG. 2 is a conceptual diagram depicting the use of a read-aheadmechanism facilitated by a sorted list of block identifiers. FIG. 2depicts two states 200A and 200B of a storage device 202, memory 204,and a log entry pointer 206 when using a read-ahead mechanism. Thestorage device 202 includes a block log 203. The memory 204, at state200A, includes a first in-memory subset 205A of the block log 203. Thememory 204, at state 200B, includes a second in-memory subset 205B ofthe block log 203. Each block identifier in the block log 203 isreferred to as an “entry”.

State 200A depicts the state of the storage device 202, memory 204, andlog entry pointer 206 after stages A and B. State 200B depicts the stateof the storage device 202, memory 204, and log entry pointer 206 aftertwo block identifiers are iterated over and after the completion ofstage C.

At stage A, the first four block identifiers of the block log 203 areread from the storage device 202 and stored in the memory 204. Thesefour block identifiers, ‘42’, ‘350’, ‘627’, and ‘779’, become the firstin-memory subset 205A. Typically, the number of block identifiers readfrom the storage device 202 and stored in the memory 204 is greater thanone. However, the particular number of entries read at stage A can bebased on a variety of factors, such as the performance of the storagedevice 202, the performance of a computing system that includes thememory 204, etc.

At stage B, the log entry pointer 206 is initialized to the first blockidentifier entry and iteration over the block identifiers begins. Thelog entry pointer 206 can be a variable or other construct that includesan address that points to the location in the memory 204 that includesthe first block identifier, ‘42’, an offset value used to specify aparticular location in memory based on a base address that points thebeginning of the first in-memory subset 205A, etc. Initializing the logentry pointer 206 includes setting the log entry pointer 206 to thevalue that allows the first entry in the first in-memory subset 205A tobe accessed.

To iterate over the block identifiers, the block identifier referencedby the log entry pointer 206 is read from the memory 204 and used tofree the block associated with the block identifier. After the blockidentifier is read, the log entry pointer 206 is set to point to thenext entry in the first in-memory subset 205A. Setting the log entrypointer 206 to point to the next entry can include incrementing the logentry pointer 206, adding a value to the log entry pointer 206, etc.

Stage C occurs when the log entry pointer 206 nears the end of thecurrent in-memory subset of the block log 203. In the example depictedin FIG. 2, when the log entry pointer 206 is set to point to the thirdentry of the subset 205A, stage C occurs.

At stage C, the next four block identifiers of the block log 203 areread from the storage device 202 and stored in the memory 204. Thesefour block identifiers, ‘484’, ‘627’, ‘642’, and ‘748’, combine with thefirst in-memory subset 205A to become the second in-memory subset 205B.As discussed above, the particular number of block identifiers read fromthe storage device 202 and stored in the memory 204 can vary.

The general process described in stages A-C is performed until allentries in the block log 203 are read into memory. Once all entries inthe block log 203 are read into memory, the in-memory entries areiterated over until the end of the in-memory block log is reached.

Reading block identifiers from the block log 203 into the memory 204prior to the log entry pointer reaching the end of the particularin-memory subset prevents the iteration from halting while additionalblock identifiers are read into the memory 204. In other words, if thelog entry pointer 206 were to reach the last entry of the firstin-memory subset 205A before the second set of four entries were readfrom the storage device 202 and stored into the memory 204, theiterative process would halt while the four entries were read from thestorage device 202. On the other hand, if the second set of four entriesare read from the storage device 202 and stored in the memory 204 priorto the log entry pointer 206 reaching the last entry of the firstin-memory subset 205A, the iterative process can continue withouttemporarily halting.

The timing associated with reading additional entries from the storagedevice 202 can be determined by estimating (or calculating) the amountof time it takes to read the entries from the storage device 202 andstore them in the memory 204. For example, if it takes fiftymilliseconds to read a set of entries from the storage device 202 andstore them in the memory 204, the entries can be read approximatelyfifty milliseconds before the log entry pointer 206 reaches the end ofthe current in-memory subset. Thus, various characteristics can factorinto determining the timing of when additional entries are read frommemory, including the rate at which the entries are being iterated over(i.e., processor speed, etc.) and the performance of the storage device202 and the memory 204.

It should be noted that while the entries of the block log 203 and thein-memory subsets are depicted as being arranged linearly, in practicethey may be located in noncontiguous locations. For example, a virtualmemory system might make the second in-memory subset 205B appear to bein contiguous memory locations while some portions of the secondin-memory subset 205B are actually stored in physical memory locationsthat are not adjacent to other portions of the second in-memory subset205B.

Further, while FIG. 2 depicts the in-memory subsets as containing allprevious entries that have been iterated over, some configurations canallow the iterated-over entries be replaced or overwritten by additionaldata. For example, the in-memory subsets might be stored in the memory204 as a ring buffer. Thus, old entries in the memory 204 might beoverwritten by new entries, decreasing the amount of memory used by thein-memory subsets.

Log Switch Example Illustration

As described above, once an active log reaches a certain size threshold,the active log is made “inactive” and replaced by a new, empty activelog. This functionality can be implemented using two block logs. Wheninitialized, both block logs are empty, and a first of the block logs isdesignated as the active log. An insertion unit then begins appendingnew block identifiers to the active log. Once the first block logreaches the size threshold described above, the first block log isdesignated as the inactive log and the second of the block logs isdesignated as the active log. The insertion unit then appends new blockidentifiers to the second block log.

Once the first block log is designated as the inactive log, the firstblock log is processed to generate the sorted log. As entries are addedto the sorted log, the corresponding entries are removed from the firstblock log until the first block log is empty. Once the second block logreaches the size threshold, the log switch is performed again. The blocklogs can be implemented as log files on a storage device.

FIG. 3 is a conceptual diagram depicting the performance of a log switchbetween two different block logs. FIG. 3 depicts a first block log 302,second block log 304, and insertion unit 306. Insertion unit 306includes a pointer 308 to the insertion target location of the currentactive log file. State 300A depicts the state of the first block log302, second block log 304, and insertion unit 306 before the log switchis performed but after block identifiers have been inserted into thefirst block log 302. State 300B depicts the state of the first block log302, second block log 304, and insertion unit 306 after the log switchis performed. FIG. 3 also depicts two possible mechanisms, block logmanagement mechanisms #1 and #2, for tracking which block log is theactive log. The mechanisms are not mutually exclusive, and othermechanisms might be used.

Prior to the log switch, as depicted at state 300A, the first block log302 is the active log. The second block log 304 is the inactive log.Block log management mechanism #1 uses a status variable(“BlockLog1_Role”) to indicate the particular role of the first blocklog 302. Because the first block log 302 is the active log, the statusvariable “BlockLog1_Role” is set to “ACTIVE”. Block log managementmechanism #2 uses a pointer (“ActiveLogPointer”) to indicate which blocklog is the active log. Because the first block log 302 is the activelog, the pointer “ActiveLogPointer” is set to the pointer to the firstblock log (“BlockLog1Ptr”). Additionally, the next insertion target 308is at the tail of the first block log 302.

After the log switch, as depicted at state 300B, the second block log304 is the active log and the first block log 302 is the inactive log.Accordingly, for block management mechanism #1, the status variable“BlockLog1_Role” is set to “INACTIVE”. Similarly, for block managementmechanism #2, pointer “ActiveLogPointer” is set to the pointer to thesecond block log (“BlockLog2Ptr”). Additionally, the next insertiontarget 308 is set to the tail of the second block log 304.

One characteristic of both block management mechanisms is the use of asingle indicator (“BlockLog1_Role” or “ActiveLogPointer”) to indicatewhich of the block logs is the active log. The indication that one ofthe two block logs is the active log means that, by default, the otherblock log is the inactive log. The use of a single indicator allows thelog switch to occur atomically (or nearly atomically, as otheroperations may be performed depending on the specific configuration).

Example Operations for Freeing Blocks

FIGS. 4 and 5 are flowcharts depicting example operations for appendingblock identifiers to a block log, generating a sorted log, and freeingthe blocks identified in the sorted log.

FIG. 4 is a flowchart depicting example operations for appending blockidentifiers to a block log, performing a log switch, and sorting subsetsof the block log. The example operations depicted in FIG. 4 can beperformed by a block free unit, such as the block free unit 108 depictedin FIG. 1, or another component.

At block 400, a block free unit receives an indication that a block isto be freed. The indication can come from a client, another componentcommunicatively coupled to the block free unit, etc. The indicationincludes at least a block identifier that identifies the block to befreed. After the block free unit receives the indication that the blockis to be freed, control then flows to block 402.

At block 402, the block free unit appends the block identifier of theblock to be freed (hereinafter “block identifier”) to a current subsetof an active log. To facilitate appending the block identifier, theblock free unit maintains a pointer to the active log. The pointer canpoint to the specific location in the active log that the blockidentifier should be written to or point to the beginning of the activelog. If the pointer points to the beginning of the active log, the blockfree unit can maintain an offset that indicates where in the active log,relative to the pointer, the block identifier should be written. Inother words, the pointer might point to entry ‘0’ of the active log,while the offset specifies that the block identifier should be writtento entry ‘20’.

As described above, the active log is divided into subsets. Once asubset reaches a maximum subset size, the next block identifier appendedbecomes the next subset. Thus, the block free unit might also maintainan indication of the size of the current subset. The various data usedby the block free unit, such as a pointer, offset, and/or subset sizecount can be updated when the block free unit appends the blockidentifier to the active log. After the block free unit appends theblock identifier to the current subset of the active log, control thenflows to block 404.

At block 404, the block free unit determines whether the active log sizeis greater than a threshold. To determine whether the active log size isgreater than the threshold, the block free unit compares an indicationof the active log size with the threshold. The block free unit canmaintain the indication of the active log size or might query a filesystem or storage device to determine the active log size. The thresholdcan be preconfigured, determined dynamically, or a combination thereof.For example, the threshold might be a percentage of available space onone or more storage devices. The particular percentage might bepreconfigured while the actual threshold is dynamically determined bydetermining the amount of available space and multiplying the amount ofavailable space by the particular percentage. The block free unit canthen compare the threshold with the size of the active log. If the blockfree unit determines that the active log size is not greater than thethreshold, control then flows to block 406. If the block free unitdetermines that the active log size is greater than the threshold,control then flows to block 410.

At block 406, the block free unit determines whether the current subsetsize is equal to a size threshold. To determine whether the currentsubset size is equal to the threshold, the block free unit compares anindication of the current subset size to the threshold. The currentsubset size can be maintained and updated as block identifiers areinserted into the active log. In some configurations, the block freeunit might not maintain the actual size of the current subset. Forexample, as described above, the block free unit might maintain anoffset that indicates where in the active log a block identifier shouldbe appended. In such an instance, the block free unit can determinewhether the current subset size is greater than the threshold byperforming a modulo operation using the offset and the maximum subsetsize. For example, assuming a zero-based offset, each time the remainderof the offset divided by the threshold is zero (after the first blockidentifier is inserted), the current subset is equal to the threshold.If the block free unit determines that the current subset size is equalto the threshold, control then flows to block 408. If the block freeunit determines that the current subset size is not equal to thethreshold, the process ends.

At block 408, the block free unit sorts the current subset of blockidentifiers. To sort the current subset of block identifiers, the blockfree unit reads the block identifiers in the active log that correspondto the current subset and performs one or more sort operations on theblock identifiers. The particular sort operation(s) can vary. Forexample, the specific sort operation(s) can change based on theparticular sorting algorithm, such as quicksort or mergesort, used tosort the block identifiers. The sorted block identifiers are writtenback to the active log in the same set of entries from which they wereread. After the block free unit sorts the current subset of blockidentifiers, the process ends.

Control flowed to block 410 if it was determined, at block 404, that theactive log size is greater than a threshold. At block 410, the blockfree unit performs a log switch between the active log and an inactivelog. To perform the log switch, the block free unit updates one or moreindications to indicate that the inactive log is the new active log (orvice versa). For example, as described above, the block free unit mightset a status variable indicating that a particular block log is now theactive log, thus also indicating that a second block log is now theinactive log. After the block free unit performs the log switch betweenthe active log and the inactive log, control then flows to block 412.

At block 412, the block free unit sorts the last subset of blockidentifiers of the inactive log. To sort the last subset of blockidentifiers of the inactive log, the block free unit can performoperations substantially similar to those described at block 408. Afterthe block free unit sorts the last subset of block identifiers of theinactive log, control then flows to block 414.

At block 414, the block free unit generates a sorted log by merging allsubsets of the inactive log into a single sorted log. The block freeunit can generate the sorted log by utilizing a modified heapsortalgorithm. As each block identifier from the inactive log is added tothe sorted log, the block identifier is removed from the inactive log.Thus, when the generation of the sorted log is completed, the inactivelog contains no more block identifiers. The use of a heapsort algorithmto merge the subsets of the inactive log into the sorted log isdescribed in greater detail below. After the block free unit generatesthe sorted log by merging all subsets of the inactive log into a singlesorted log, control then flows to block 500 of FIG. 5.

FIG. 5 is a flowchart depicting example operations for updating metadataassociated with freed blocks indicated in a sorted log. The exampleoperations depicted in FIG. 5 can be performed by a block free unit,such as the block free unit 108 depicted in FIG. 1, or anothercomponent.

Control flowed to block 500 after the block free unit generated, atblock 414 of FIG. 4, the sorted log by merging all subsets of theinactive log into the single sorted log. At block 500, the block freeunit indicates that blocks associated with the sorted log, the referencecount map, and the active map should be read from storage. For example,the block free unit might determine block identifiers associated withthe blocks at which the sorted log, reference count map, and active mapare stored. The block free unit might then send the determined blockidentifiers to another process or component, which can initiate thereading of the data associated with the sorted log, reference count map,and active map from one or more storage devices. This allows the processof reading the associated data into memory to begin prior to actuallyaccessing the data, thus mitigating the amount of time the block freeunit waits for data to load. As another example, the block free unitmight send identifiers for the sorted log, reference count map, andactive map instead of individual block identifiers. After the block freeunit indicates that blocks associated with the sorted log, the referencecount map, and the active map should be read from storage, control thenflows to block 502.

At block 502, the block free unit begins a loop in which the sorted logis processed. During the initial pass through block 502, the block freeunit initializes a current block identifier pointer to refer to thefirst entry in the sorted log. On subsequent passes through block 502,the block free unit updates the current block identifier pointer torefer to the next entry in the sorted log. In some storage systems, thecurrent block can be identified using a pointer to sorted log (e.g., apointer to the first entry in the sorted log) and an offset value thatindicates the specific entry of the sorted log that is the current blockidentifier. After the block free unit initializes or updates the currentblock identifier pointer, control then flows to block 504.

At block 504, the block free unit reads the current block identifierfrom the sorted log. The current block identifier is indicated by thecurrent block identifier pointer that was initialized or updated atblock 502. The current block identifier pointer indicates the locationin memory at which the current block identifier is stored (which wasstored in memory by the operations depicted at blocks 500 through 508).Thus, the block free unit reads the memory location indicated by thecurrent block identifier pointer. After the block free unit reads thecurrent block identifier from the sorted log, control then flows toblock 506.

At block 506, the block free unit reads, from a reference count map,reference count data corresponding to the current block identifier. Asdescribed above, the reference count map indicates the reference countfor each block, with each location in the reference count mapcorresponding to a particular block. Thus, if the reference count isstored as a byte, the first byte corresponds to the first block, thesecond byte corresponds to the second block, the nth byte corresponds tothe nth block, etc. Thus, the block free unit reads the particularportion of the reference count map that corresponds to the current blockidentifier.

The reference count map, however, is generally subject to the sameinput/output configuration as other data. Thus, instead of reading theparticular byte corresponding to the current block identifier, the blockfree unit reads a block of data that includes the corresponding byte.For example, if the current block identifier is ‘150’ and the block sizeis 100 bytes, the block free unit actually reads bytes 100 through 199in order to access the single byte for block ‘150’. After the block freeunit reads the reference count data corresponding to the blockidentifier, control then flows to block 508.

At block 508, the block free unit decrements the reference countassociated with the current block identifier. In particular, the blockfree unit decrements the reference count associated with the currentblock identifier by one, indicating that one of the references to theparticular block identified by the current block identifier has beenfreed. After the block free unit decrements the reference countassociated with the current block identifier, control then flows toblock 510.

At block 510, the block free unit writes the reference count datacorresponding to the current block identifier back to the referencecount map. In other words, in decrementing the reference count at block508, the block free unit updates the data read from the reference countmap at block 506. The block free unit now stores the updated data backto the reference count map. To do so, the block free unit can write thedata to the same location from which the reference count data was read.After the block free unit writes the reference count data correspondingto the current block identifier back to the reference count map, controlthen flows to block 512.

At block 512, the block free unit determines whether the reference countfor the block identified by the current block identifier is equal tozero. Determining whether the reference count for the block identifiedby the current block identifier is equal to zero allows the block freeunit to determine whether the block should be indicated as free. Inother words, if there are still one or more references referring to theblock, the block is not truly freed. Thus, if the reference count is notzero, the block free unit need not continue to update any metadata thatindicates whether the block is actually free (i.e., available to beallocated). If the block free unit determines that the reference countfor the block identified by the current block identifier is equal tozero, control then flows to block 514. If the block free unit determinesthat the reference count for the block identified by the current blockidentifier is not equal to zero, control then flows to block 518.

At block 514, the block free unit reads active map data corresponding tothe current block identifier from an active map. As described above, theactive map is a bitmap in which each bit corresponds to a respectiveblock. If a bit corresponding to a particular block is set to ‘0’ theparticular block is free. Because the block free unit determined thatthe reference count to the block corresponding to the current blockidentifier is equal to zero, the block free unit can update the activemap data to indicate that the block is free.

The block free unit can read the active map data in a mannersubstantially similar to that used to read the reference count data atblock 506. Similarly, the block free unit generally does not read asingle bit, but reads a block of data that includes the bit for theblock corresponding to the current block identifier. After the blockfree unit reads the active map data corresponding to the current blockidentifier from the active map, control then flows to block 516.

At block 516, the block free unit updates the bit in the active map datacorresponding to the current block identifier. The block free unit canupdate the bit in various ways depending on the configuration. Forexample, the block free unit might explicitly set the bit correspondingto the current block identifier to a particular value (‘0’ in thisexample). As another example, the block free unit might apply a bit maskto the active map data that results in the changing of the individualbit to the appropriate value. More particularly, assume that the activemap data is ‘011010’ in binary and that the third bit is the particularbit corresponding to the current block identifier. Performing abitwise-AND operation using the bitmask ‘110111’ results in updatedactive map data ‘010010’. Thus, the bit corresponding to the currentblock identifier is set to ‘0’. After the block free unit updates thebit in the active map data corresponding to the current blockidentifier, control then flows to block 518.

At block 518, the block free unit writes the active map datacorresponding to the current block identifier back to the active map.The block free unit can write the active map data back to the active mapusing operations substantially similar to those used to write thereference count data back to the reference count map at block 510. Afterthe block free unit writes the active map data corresponding to thecurrent block identifier back to the active map, control then flows toblock 518.

At block 520, the block free unit determines whether the current blockidentifier is the last block identifier in the sorted log. The blockfree unit can determine whether the current block identifier is the lastblock identifier in the sorted log by determining the number of entriesin the sorted log. The block free unit can then compare the currentblock identifier pointer with the memory location corresponding to thelast entry in the sorted log. If the current block identifier pointerrefers to the memory location corresponding to the last entry in thesorted log, the current block identifier is the last block identifier inthe sorted log. If the block free unit determines that the current blockidentifier is the last block identifier in the sorted log, control thenflows to block 522. If the block free unit determines that the currentblock identifier is not the last block identifier in the sorted log,control then flows back to block 502.

At block 522, the loop in which the sorted log is processed ends and theoperations depicted at blocks 502 through 518 end.

Example Operations for Generating a Sorted Log

As described above, the block free unit can utilize a modified heapsortalgorithm to merge the subsets of block identifiers in the inactive loginto a single sorted log. In a typical heapsort algorithm, all elementsof a list of elements are used to create a binary heap. The nodes of thebinary heap are ordered relative to their children based on either aless-than-or-equal-to or greater-than-or-equal-to relationship(corresponding to a min-heap or max-heap, respectively). For example, ifordered based on a less-than-or-equal-to relationship, each parent nodeis less than or equal to its child nodes. To create the sorted list, theroot node is removed and added to the list. The root node is thenreplaced by the last leaf node, which is sifted down until theparticular order property is restored.

Instead of adding all block identifiers in the inactive log to thebinary heap initially, the block free unit adds the first blockidentifier of each subset to the binary heap. As root nodes are removedto generate the sorted log, the block free unit adds the next blockidentifier from the same subset associated with the removed blockidentifier. In other words, if the root node contains the fifth blockidentifier from the sixth subset of block identifiers, the next nodeadded to the binary heap is the sixth block identifier from the sixthsubset of block identifiers.

FIG. 6 is a flowchart depicting example operations for generating asorted log by merging multiple sorted subsets of block identifiers. Theexample operations depicted in FIG. 6 can be performed by a block freeunit, such as the block free unit 108 depicted in FIG. 1, or anothercomponent. The binary heap used in the operations depicted in FIG. 6 isa min-heap, in which all child nodes are greater than the correspondingparent node, resulting in the root node being the minimum value in thebinary heap.

At block 600, a block free unit initializes a binary heap with the firstblock identifier from each subset of block identifiers in an inactivelog. During the loop, the block free unit iterates through the sortedsubsets of block identifiers in the inactive log. On the initial passthrough block 600, the block free unit initializes a value indicating acurrent subset. The current subset can be the first subset of blockidentifiers in the inactive log. However, the subsets can be iteratedover in any order. As such, the initial current subset need not be thefirst subset of block identifiers in the inactive log. During subsequentpasses through block 600, the block free unit updates the current subsetto be the next subset of block identifiers in the inactive log. The nextsubset need not be based on the sequential order in which the subsetsappear in the inactive log.

The block free unit can maintain a pointer for each subset thatindicates the location in memory at which the respective subset resides.The block free unit can also maintain an offset for each subset thatindicates which block identifier in the respective subset is the firstblock identifier. This data can be used to facilitate the operationsdepicted in FIG. 6. Additionally, the current subset can be indicated bysetting a particular variable to the pointer for the particular subset.After the block free unit initializes current subset, control then flowsto block 602.

At block 602, the block free unit reads the first block identifier fromthe current subset. The block free unit can read the first blockidentifier from the current subset by reading the memory locationindicated by the pointer to the current subset (or a combination of thepointer and an offset). After the block free unit reads the first blockidentifier from the current subset, control then flows to block 604.

At block 604, the block free unit adds the block identifier to thebinary heap. To add the block identifier to the binary heap, the blockfree unit inserts the block identifier as a leaf node. The block freeunit then performs a “sift up” operation in which the node correspondingto the block identifier is swapped with its parent node while the parentnode block identifier is greater. After the block free unit adds theblock identifier to the binary heap, control then flows to block 606.

At block 606, the block free unit removes the block identifier from theinactive log. The block identifier can be removed from the inactive logexplicitly or implicitly. To remove the block identifier from theinactive log explicitly, the block free unit can overwrite the blockidentifier with a default value, such as ‘0’ or ‘NULL’. To remove theblock identifier from the inactive log implicitly, the block free unitcan update the pointer to the current subset to reference the next blockidentifier in the current subset (or increment an offset indicating acurrent block identifier within the current subset). After the blockfree unit removes the block identifier from the inactive log, controlthen flows to block 608.

At block 608, the block free unit determines whether the current subsetis the last subset of the inactive log. In other words, the block freeunit determines whether all subsets of block identifiers have beeniterated over. The technique used to determine whether all subsets ofblock identifiers have been iterated over can vary. For example, if theblock free unit is iterating over the subsets linearly (as they appearin a linear representation of the inactive log), the block free unit candetermine whether the current subset is the last subset in the activelog. If the iteration is not based on the position of the subsets in theactive log, the block free unit might reference metadata that trackswhether each particular subset has been iterated over. If the block freeunit determines that the current subset is the last subset of theinactive log, control then flows to block 610. If the block free unitdetermines that the current subset is not the last subset of theinactive log, control then flows back to block 600.

At block 610, the binary heap initialization loop ends. At the end ofthe binary initialization loop, the binary heap includes the first blockidentifier from each subset of block identifiers in the inactive log.Because the subsets of block identifiers were previously sorted inascending order, the binary heap contains the lowest block identifier ineach of the subsets. Further, because the binary heap is a min-heap, theblock identifiers in the binary heap are sorted in ascending order fromroot to leaf nodes. After the binary heap is initialized via the binaryheap initialization loop, control then flows to block 612.

At block 612, the block free unit appends the block identifiercorresponding to the root node to the sorted log. The properties of amin-heap result in the block identifier associated with the root nodebeing the minimum block identifier of all block identifiers in thebinary heap. Further, because the subset of block identifiers arealready sorted, the root node of the binary heap is the minimum blockidentifier of all block identifiers remaining in the subsets. After theblock free unit appends the block identifier corresponding to the rootnode to the sorted log, control then flows to block 614.

At block 614, the block free unit replaces the root node by the maximumblock identifier leaf node. In other words, the root node is replaced bythe last leaf node, which, due to the properties of a min-heap, is themaximum block identifier. Replacement of the root node by the leaf noderesults in the moving of the leaf node to the root position of thebinary tree. After the block free unit replaces the root node by themaximum block identifier leaf node, control then flows to block 616.

At block 616, the block free unit sifts the new root node down in thebinary heap. To sift the new root node down in the binary tree, theblock free unit swaps the new root node with a child node that is lessthan the new root node. The block free unit continues to swap the newroot node with child nodes until no child node is less than the new rootnode. After being sifted down, the new root node is no longer the rootnode, but merely part of the binary heap. After the block free unitsifts the new root node down in the binary heap, control then flows toblock 618.

At block 618, the block free unit determines whether the subsetassociated with the block identifier written to the sorted log at block612 is empty. To put it another way, the block identifier written to thesorted log at block 612 came from one of the subsets of blockidentifiers in the inactive log. The block free unit determines whetherthe subset of block identifiers from which the block identifier camefrom is empty. The particular subset from which the block identifiercame from can be identified by maintaining a pointer to the subset inthe node with the block identifier. If the block free unit determinesthat the subset associated with the block identifier is not empty,control then flows to block 620. If the block free unit determines thatthe subset associated with the block identifier is empty, control thenflows to block 626.

At block 620, the block free unit reads the first block identifier fromthe subset associated with the block identifier written to the sortedlog at block 612. The operations the block free unit performs to readthe first block identifier from the subset can be substantially similarto those described at block 602. The first block identifier of thesubset is the first block identifier remaining in the subset. In otherwords, the original first block identifier of the subset was removed atblock 606, at which point the next block identifier becomes the firstblock identifier. This process is continued at blocks 620 through 624.After the block free unit reads the first block identifier from thesubset associated with the block identifier, control then flows to block622.

At block 622, the block free unit adds the block identifier read atblock 620 to the binary heap. The operations performed by the block freeunit to add the block identifier to the binary heap can be substantiallysimilar to those performed at block 604. After the block free unit addsthe block identifier to the binary heap, control then flows to block624.

At block 624, the block free unit removes the block identifier read atblock 620 from the inactive log. The operations performed by the blockfree unit to remove the block identifier from the subset can besubstantially similar to those performed at block 606. After the blockfree unit removes the block identifier from the subset, control thenflows back to block 612.

Control flowed to block 626 if it was determined, at block 618, that thesubset associated with the root node is empty. At block 626, the blockfree unit determines whether the inactive log is empty. The inactive logis empty when all block identifiers have been removed from the inactivelog. The mechanism used to determine whether the inactive log is emptycan vary. For example, the block free unit might track the count ofblock identifiers in the inactive log. Each time a block identifier isremoved, the block free unit can decrement the count. When the countreached zero the inactive log is empty.

As another example, assume that a block identifier is removed by writinga default value to the corresponding entry in the inactive log. Once thelast block identifier for a particular subset is removed, the pointer isupdated to refer to the beginning of the next subset. The particularentry at the beginning of the next subset, however, was set to thedefault value after the binary heap was initialized. Thus, the blockfree unit can determine that the particular subset is empty bydetermining that the pointer currently references an entry that is setto the default value. Thus, to determine that the inactive log is entry,the block free unit determines whether the pointer to each subset ofblock identifiers references an entry that is set to the default value.If the block free unit determines that the inactive log is not empty,control then flows back to block 612. If the block free unit determinesthat the inactive log is empty, the process ends.

Impact on Spatial Locality Example Illustration

FIG. 7 is a conceptual diagram illustrating the increased spatiallocality facilitated by a sorted log. FIG. 7 depicts an insertion unit702, an active map 704, and a memory 706. A first example 700A depictsthe flow of data from the active map 704 to the memory 706 when anunsorted log 708 is used to free blocks. A second example 700B depictsthe flow of data from the active map 704 to the memory 706 when a sortedlog 710 is used to free blocks.

The active map 704 is depicted as a set of bits arranged into bytes(eight bits). In this example, the block size is a byte, meaning thatdata read from the active map 704 is read in bytes. Thus, to read thebit associated with block 3, the entire first byte 712 is read into thememory 706.

The first example 700A depicts the insertion unit 702 iterating over anunsorted log 708. The unsorted log 708 includes at least blockidentifiers 2, 16, and 3.

At stage A, the insertion unit 702 is currently iterating over the firstblock identifier, 2, of the unsorted log 708. When the insertion unit702 iterates over a particular block identifier, the insertion unit 702reads the bit associated with that particular into the memory 706.However, because the block size is a byte, the insertion unit 702actually reads the entire byte that contains the particular bit. Thus,at stage A, byte 712 is read into the memory 706.

At stage B, the insertion unit 702 is currently iterating over thesecond block identifier, 16, of the unsorted log 708. The seventieth bitof the active map is the first bit of the third byte 714 (assuming blockidentifiers start at zero). Thus, the insertion unit 702 reads the thirdbyte 714 into the memory 706.

It is assumed that, between stages B and C, the insertion unit 702 hasiterated over enough block identifiers that the first byte 712 is nolonger in the memory 706.

At stage C, the insertion unit 702 is currently iterating over the nthblock identifier, 3, of the unsorted log 708. The fourth bit of theactive map is the fourth bit of the first byte 712. Thus, the insertionunit 702 reads the first byte 712 into the memory 706.

The second example 700B depicts the insertion unit 702 iterating over asorted log 710. The sorted log 710 includes at least block identifiers2, 3, and 16 sorted in ascending order. In this particular example,block identifiers 2, 3, and 16 are the first block identifiers in thesorted log 710.

At stage D, the insertion unit 702 is currently iterating over the firstblock identifier, 2, of the sorted log 710. As above at stage A, theinsertion unit 702 reads the first byte 712 into the memory 706.

At stage E, the insertion unit 702 is currently iterating over thesecond block identifier, 3, of the sorted log 710. However, theinsertion unit 702 read the first byte 712 into memory at stage D. Thus,the first byte 712 is already resident in the memory 706 and does notneed to be read from the active map 704.

At stage F, the insertion unit 702 is currently iterating over the thirdblock identifier, 16, of the sorted log 710. As above at stage B, theinsertion unit 702 reads the third byte 714 into the memory 706.

The data flows depicted by the two examples 700A and 700B of FIG. 7illustrate two particular characteristics of a sorted log. First, whenthe unsorted log 708 is used, it is possible that a single block is readmultiple times due to the random appearance of block identifiers. Inother words, even though the bits for block identifiers 2 and 3 are bothin the first byte 712, the first byte 712 might be read into the memorytwice 706. However, when using the sorted log 710, the first byte 712 isonly read into the memory 706 once. This follows from the fact that allblock identifiers associated with a particular block will come beforeall block identifiers associated with the next block when using thesorted log 710.

Second, when the unsorted log 708 is used, the data from the active map704 might be read randomly. When the active map 704 is stored on certaintypes of storage devices, such as a hard disk, random reads can resultin a performance penalty. Thus, the sorted log 710 can allow theinsertion unit 702 to take advantage of sequential reads, as indicatedby the arrow 716.

The examples described above do not make a distinction between blocksassociated with different storage objects. A storage object is,effectively, a collection of blocks of data. Examples of storage objectsinclude volumes, files, directories, etc. Storage objects can also be acollection of other storage objects (e.g., a volume might be acollection of files). Many aspects of the operation of a storage systemcan be done on a per-storage-object basis. For example, access to datacan be controlled on a per-storage-object basis and metadata can bemaintained on a per-storage-object basis. In other words, storageobjects can be treated as individual entities. Storage objects are,generally, logical constructs, meaning that storage objects may have nocorrespondence with physical entities. For example, while there might bea one-to-one relationship between volumes and storage devices in someconfigurations, a volume might comprise data on part of a first storagedevice and data on part of a second storage device.

Accordingly, the operations described herein can be adapted to work on aper-storage-object basis. For example, if a storage system is configuredas a set of volumes, the storage system might maintain a set of blocklogs for each volume (e.g., an active log, inactive log, and sorted logfor each volume). Similarly, there may be a separate instance of theblock free unit for each volume. Additional operations might beperformed to facilitate the per-storage-object functionality, such asrouting particular block free indications to the appropriate block freeunit associated with the volume that contains the block identified bythe block free indication.

Further, the examples described above utilize multiple log files tofacilitate the operations. However, a storage system can implementsimilar functionality utilizing fewer or more block logs. For example,consider a storage system that utilizes a single block log. Once acertain number of subsets have been sorted, a block free unit mightdesignate the sorted subsets as subsets that should be merged. The blockfree unit can then merge the subsets in place, overwriting the existingdata in the subsets, instead of merging the subsets into a separateblock log. The block free unit can continue to append block identifiersto the block log while the subsets are being merged. Once the subsetsare merged, the block free unit can iterate through the merged blockidentifiers and free them as described above. The block log can befurther implemented as a circular buffer in which the block free unitbegins inserting the block identifiers at the beginning of the block logafter the block log reaches a certain size.

Resource Reservation Example Illustrations

Some of the operations described above are at least partially dependenton the completion of later operations. For example, in a configurationthat uses a single sorted log, it might not be possible to generate anew sorted log from an inactive log until the blocks identified in anexisting sorted log are freed. Further, in some configurations, if theinactive log is not empty, a log switch cannot be performed. If a logswitch cannot be performed and the active log has reached a maximumsize, no more block identifiers can be inserted into the active log.Thus, commands that result in freed blocks might be delayed until blockidentifiers can be inserted into the active log again.

To state it another way, the operations to free a block can be viewed asa pipeline. Delays along the pipeline or a large number of incomingblock free indications can result in delaying responses to incomingoperations. As described above, small, periodic delays can be betterthan a single, long delay in some instances. In other words, amortizingthe cost of the metadata updates over a long period of time can be lessnoticeable to users and less likely to cause errors.

The possibility of a large delay can be reduced by, effectively,reserving resources based on an incoming workload. For example, for eachreceived indication that a block should be freed, a block free unitmight wait until a block is freed before accepting anymore indications.Thus, instead of delaying a large number of commands at once, the blockfree unit ends up delay a small number of commands over a longer periodof time. By tying the incoming indications to the work performed to freethe identified blocks, a block free unit can cause the delay inresponding to incoming operations to increase gradually. Thus, theclients (or protocols employed by the clients) can react accordingly bydecreasing the rate at which operations are issued to the storagesystem.

In order to reserve resources based on an incoming workload (“frontendwork”), a block free unit records an indication that identifies theamount of work that should be performed to free blocks (“backend work”).The indication can be a count of the number of block free indicationsreceived in a particular time period. The various components of theblock free unit, such as the free unit, can then determine the amount ofbackend work that should be performed to compensate for the frontendwork received. The indication can be individual indications for eachcomponent of the block free unit. For example, for every n block freeindications received, the block free unit might indicate that 1.10×nunits of work should be performed by a particular component.

A block free unit might reserve resources in particular circumstancesinstead of all the time. For example, the block free unit might notreserve resources until the size of one or more of the block logsreaches a particular threshold. For example, if the size of the activelog surpasses eighty percent of a maximum size, the block free unitmight begin to reserve resources. Similarly, if the size of the one ormore block logs falls below the threshold, the block free unit mightstop reserving resources.

FIG. 8 depicts a block free unit with a resource reservation-basedworkload management unit. FIG. 8 depicts a subset of a storage system800 including a block free unit 802. The block free unit 802 includes aresource reservation-based workload management unit (hereinafter“workload management unit”) 804, an insertion unit 806, a sort unit 808,a sorted merge unit 810, and a free unit 812. In this exampleillustration, notifications that blocks should be freed (hereinafter“notifications”) 814 are received by the workload management unit 804.The workload management unit 804 tracks the received notifications 814and performs the management operations described below.

At stage A, the workload management unit 804 monitors various statisticsand data related to the operation of the block free unit 804. In thisparticular example, the workload management unit 804 monitors the sizeof the block logs employed by the block free unit 804, an active log, aninactive log, and a sorted log (none depicted). The workload managementunit 804 can monitor the size of the block logs individually or as awhole. The workload management unit 804 can also monitor varies otheraspects of the operation of the block free unit 804, such as the rate ofincoming notifications 814 and the rate at which block metadata isupdated to indicate that blocks are free. While depicted as anindividual stage, the operations described at stage A are generallyongoing, meaning they can occur while operations at other stages arealso being performed.

At stage B, the workload management unit 804 receives a set of nnotifications 814. The n notifications 814 can be associated withspecific blocks that are being freed, in which case the n notifications814 can include the block identifiers of the specific blocks. The nnotifications 814 might not be associated with specific blocks; instead,the n notifications 814 might just be general notifications that nblocks are going to be freed. Regardless of whether the n notifications814 include block identifiers or not, each notification indicates thatthe block free unit 802 will be receiving a block identifier at somepoint.

At stage C, the workload management unit 804 allocates sufficient spaceto the active log to allow n block identifiers to be inserted into theactive log. To allocate the space, the workload management unit 804 canindicate to a file system that additional data blocks should beallocated to the active log. Generally, the size of each blockidentifier is fixed. For example, each block identifier might be eightbytes in size. Every time the workload management unit 804 receives nnotifications, the workload management unit 804 can indicate to the filesystem that a corresponding number of data blocks should be allocated tothe active log. The particular number of notifications received prior tothe workload management unit 804 can vary. For example, consider a filesystem that is implemented using indirect blocks, which are metadatablocks that point to a plurality of actual data blocks. Instead ofindicating that the file system should allocate a data block each time anotification is received, the workload management unit 804 mightindicate that the file system should allocate an indirect block to theactive log once the workload management unit 804 determines that enoughblock identifiers will be received to fill up the data blocks associatedwith the indirect block.

When blocks are allocated, metadata associated with the blocks isgenerally stored in memory. Thus, not only does allocating the spacepreemptively take up space on a storage device, but also takes up memoryon the controller. Thus, the workload management unit 804 effectivelyreserves these resources for the incoming block identifiers, making themunavailable to other operations.

At stage D, the workload management unit 804 determines that receiving nblock identifiers will result in one or more active log subsets reachinga size threshold and indicates that each of the one or more subsets isto be sorted. In other words, the workload management unit 804determines, based on the n notifications 814, that the block identifiersassociated with the n notifications 814 will cause at least one subsetof the active log to reach a size threshold. The workload managementunit 804 thus records an indication that one or more subsets will needto be sorted. The workload management unit 804 will typically record anindication of the specific number of subsets. For the exampleillustrations described herein, it will be assumed that the workloadmanagement unit 804 determines, based on the n notifications 814, thatone subset will reach the threshold and that the workload managementunit 804 indicates that one subset should be sorted.

At stage E, the workload management unit 804 determines that the size ofone or more of the block logs has exceeded a threshold and enters “tightmode”. The threshold can be a static threshold, such as one gigabyte, ora dynamic threshold, such as a percentage of the amount of spaceavailable on one or more storage devices. The threshold can also be acount of the entries in the block logs or another metric that isindicative of the block free unit workload. The workload management unit804 can determine the size of the block logs by querying a file system,storage device, etc. The workload management unit 804 might determinethe sum of the sizes of the individual blocks logs to determine the sizeof the block logs. Alternatively, the workload management unit 804 mightdetermine the size of the block logs as a whole by determining the sizeof a folder containing the block logs, for example. Tight mode might beentered when the workload management unit 804 determines that a singleblock log (such as the active log) has exceeded the threshold or whenthe aggregate size of a combination of logs have exceeded the threshold.Other mechanisms that can result in the workload management unit 804entering tight mode are discussed below.

Once the workload management unit 804 determines that the block log sizeexceeds the threshold, the workload management unit 804 enters tightmode. In tight mode, the workload management unit 804 begins to moreaggressively monitor the incoming notifications 814. In addition, theworkload management unit 804 initiates resource reservation, asdescribed below. The particular mode that the workload management unit804 is in can be indicated by a status variable or similar mechanism.

In this particular example, the workload management unit 804 attempts toavoid a scenario in which the block free unit 802 cannot perform one ormore operations that might lead to delaying the insertion of blockidentifiers into the active log. The size of the block logs provides aconvenient mechanism to determine how near such an event may be.Consider, for example, that an increase in block log size can be causedby receiving a greater number of notifications 814 than the number ofblock identifiers removed from a sorted log. Similarly, a decrease inblock log size can be caused by receiving a smaller number ofnotifications 814 than the number of block identifiers removed from thesorted log. The threshold can thus be established to allow the workloadmanagement unit 804 to begin to reserve resources prior to reaching amaximum block log size, effectively establishing a buffer prior to anevent that can lead to potentially large delays.

The operations described as being performed at stages F and G areperformed in response to entering tight mode and receiving the nnotifications 814. While in tight mode, the operations performed atstages F and G might be performed periodically, such as each time nnotifications are received, after specific intervals of time, etc. Inthis example, the trigger that causes the workload management unit 804to perform the operations at stages F and G occurs after receiving the nnotifications 814.

At stage F, the workload management unit 804 determines that theinactive log is being merged to generate the sorted log and indicatesthat at least n block identifiers should be merged into the sorted log.The specific number of block identifiers that the workload managementunit 804 determines should be merged to generate the sorted log canvary. However, a typical goal is to ensure that a sufficient number ofblock identifiers are merged into the sorted log to allow a log swap tooccur before the active log reaches a maximum size. In other words, thespecific number of block identifiers should be sufficient to allow theinactive log to be cleared of block identifiers prior to the active logreaching the maximum size. The specific number of block identifiers canthus vary depending on how close the active log is to the maximum size,whether the notifications are being received at an increasing rate, etc.For example, the specific number of block identifiers might be n×k,where k is a value that will allow the inactive log to be emptied beforethe active log reaches the maximum size. The value k might be selecteddynamically based on the aforementioned variables or staticallyconfigured based on design parameters, performance testing, etc. Theworkload management unit 804 can store the specific number of blockidentifiers that should be merged in memory.

At stage G, the workload management unit 804 determines that the sortedlog is available for processing and indicates that at least n blocksshould be freed. Similar to described above at stage F, a typical goalof the workload management unit 804 is to ensure that the blockidentifiers in the sorted log are processed at a rate sufficient toprevent a non-empty sorted log from preventing the inactive log to bemerged into the sorted log. Thus, the particular number of blocks thatshould be freed can vary in a similar manner to the number of blocksindicated at stage F.

Stages H through K depict the operations of various components withinthe block free unit 802 that the components perform with resources havebeen reserved. In particular, stages H through K depict how theoperation of the insertion unit 806, sort unit 808, sorted merge unit810 and the free unit 812 differ from the operations described above atFIG. 1, if at all, based on the operation of the workload managementunit 804.

At stage H, the insertion unit 806 receives block identifiers associatedwith blocks that are to be freed. The insertion unit 806 appends theblock identifiers to the active log as depicted above at stage B ofFIG. 1. The operations performed by the insertion unit 806 typically donot need to be modified to take advantage of the workload managementunit 804. This results from the fact that the workload management unit804 allocates sufficient space in the active log for the incoming blockidentifiers. Thus, the workload management unit 804 effectively ensuresthat the insertion unit 806 can insert the block identifiers as they arereceived.

At stage I, the sort unit 808 determines the number of subsets to sortand sorts the subsets accordingly. The sort unit 808 performs operationssubstantially similar to those of the sort unit 112 of FIG. 1. However,the sort unit 808 determines the number of subsets (and potentiallywhich subsets) to sort by querying the workload management unit 804,which determined that one or more subsets would reach the size thresholdat stage D. The actual sorting of the subsets themselves can beperformed in a substantially similar manner to that described above atstage D of FIG. 1. Once a subset is sorted, the sort unit 808 can notifythe workload management unit 804 that the subset has been sorted,allowing the workload management unit 804 to update any appropriatedata.

At stage J, the sorted merge unit 810 determines the number of blockidentifiers to merge into the sorted log and performs operations tomerge the number of block identifiers into the sorted log. Thus, thesorted merge unit 810 queries the workload management unit 804 todetermine the number of block identifiers to merge and then performs theoperations described above at stage I of FIG. 1 until an equivalentnumber of block identifiers are merged into the sorted log. The sortedmerge unit 810 can communicate the current status of the mergeoperations to the workload management unit 804, including indicating thenumber of block identifiers merged. The workload management unit 804 canupdate any appropriate data to reflect the progress.

At stage K, the free unit 812 determines the number of block identifiersto process from the sorted log and performs operations to process thenumber of block identifiers. Thus, the free unit 812 queries theworkload management unit 804 to determine the number of blockidentifiers to process and then performs the operations described aboveat stages K, L, and M of FIG. 1 until an equivalent number of blockidentifiers from the sorted log are processed. The free unit 812 cancommunicate the current status of the free operations to the workloadmanagement unit 804, including indicating the number of blockidentifiers processed. The workload management unit 804 can update anyappropriate data to reflect the progress.

While the sort unit 808, sorted merge unit 810, and the free unit 812ultimately perform the same operations as those described above at FIG.1, the particular priority at which the operations are performed varies.For example, when the workload management unit 804 indicates that thereis no specific amount of work to be performed by one of the componentsof the block free unit 802, the particular component might perform apredetermined amount of work, perform the work for a specific period oftime, etc. However, when the workload management unit 804 indicates thatthere is a specific amount of work to be performed, the particularcomponent instead performs the specific amount of work. In other words,the particular component effectively operates at a higher priority,potentially delaying other operations until the specific amount of workis performed.

The operations described above in relation to FIG. 1 describe theindividual components within the block free unit 802 (the sort unit 808,the sorted merge unit 810, and the free unit 812) as querying theworkload management unit 804 for the amount of work that should beperformed. In some storage systems, the workload management unit 804pushes the amount of work to be performed to the individual components.For example, instead of the sort unit 808 querying the workloadmanagement unit 804 for the number of subsets that are to be sorted atstage I, the workload management unit 804 can notify, at stage D, thesort unit 808 of the number of subsets that are to be sorted. The sortunit 808 can track the number of subsets that are to be sorted byaccumulating the numbers specified by the workload management unit 808and decrement the count as each subset is sorted.

The particular operations performed while in tight mode can vary. Forexample, FIG. 8 depicts the operations occurring at stages C and D asoccurring regardless of whether the workload management unit 804 is intight mode or not, while depicting the operations occurring at stages Fand G as occurring while the workload management unit 804 is in tightmode. Storage systems can vary, however. In some storage systems, forexample, all of the operations depicted at stages C, D, F, and G mightoccur regardless of whether the workload management unit 804 isoperating in tight mode. Similarly, in some storage systems, all of theoperations depicted at stages C, D, F, and G might occur only when theworkload management unit 804 is in tight mode. Further, the particularoperations performed to reserve resources can vary based on theparticular storage system configuration. The techniques described abovecan thus be adapted to perform additional or fewer operations based onthe particular storage system configuration.

FIG. 8 depicts the workload management unit 804 as a central componentthat performs the operations related to receiving the n block freenotifications. However, in some storage systems, the functionality ofthe workload management unit 804 can be performed by other components.For example, the operations performed by the workload management unit804 at stage D might be performed by the insertion unit 806 itself.Similarly, the operations related to the other components of the blockfree unit 802 might be performed by the components themselves. Theoperations can be adapted accordingly. For example, each component ofthe block free unit 802 might receive the n notifications.

The operations described above do not explicitly delay responses toincoming operations. However, by effectively “forcing” a particularamount of work to be done, the block free unit dedicates computingresources, such as processor cycles and memory, to the freeing of theblocks. Thus, less computing resources are available for processingincoming commands from clients. By reducing the amount of resourcesavailable for processing incoming commands, the commands can take longerto process and, subsequently, respond to. As more resources are consumedby the block freeing operations and/or the rate at which the incomingcommands increases, the delay in response to incoming commands increasesas well.

Increasing the delay can result in a decrease in incoming commands in atleast two particular ways. First, increasing the delay in responding toa command can slow the rate of dependent commands. A dependent commandis a command that is only sent after receiving a response to a previouscommand. For example, if the same block is written to twice, the writesshould be performed in sequential order. To preserve sequentiality, aclient can send the first write command, wait until a response isreceived verifying that the first write command was completed, then sendthe second write command. If the response to the first write command isdelayed, the second write command (and subsequent write commands) willalso be delayed, thus decreasing the rate at which commands are sent tothe storage system.

Second, many communications protocols define parameters indicating whenclients should decrease the rate at which the clients issue commands.For example, if the response time of a storage system decreases by acertain percentage, the communications protocol might specify that aclient throttle the rate at which commands are sent to the storagesystem by a proportional amount.

Thus, by reserving resources for processing based on the incomingworkload, a storage system can effectively cause clients to decrease therate at which they send commands to the storage system. Because theoperations can result in a gradual increase in response delay, there isa lower chance that fatal errors occur or the decrease in performancebecomes noticeable to a user.

Example Operations for Reserving Resources for Freeing Blocks

FIG. 9 is a flowchart depicting example operations for reservingresources to free blocks. The example operations depicted in FIG. 9 canbe performed by a block free unit, such as the block free unit 108depicted in FIG. 1, or another component.

At block 900, a block free unit receives a notification that a block isto be freed. The notification can include a block identifier and cancome from a storage system client, a storage system component, etc.After the block free unit receives the notification that the block is tobe freed, control then flows to block 902.

At block 902, the block free unit allocates space to the active log forthe associated block identifier. The block free unit can allocate thespace to the active log by allocating a sufficient number of data blocksto the active block to hold a block identifier. If data blocks arelarger than block identifiers, the block free unit might only allocatespace to the active log if there is insufficient space to otherwise adda new block identifier to the active log. To allocate the space to theactive log, the block free unit might perform operations in conjunctionwith a file system component (such as a file system manager). After theblock free unit allocates space to the active log for the associatedblock identifier, control then flows to block 904.

At block 904, the block free unit determines whether the addition of ablock identifier to the active log will cause a subset of the active logto reach a particular size threshold. For example, to determine whetherthe subset will reach the size threshold, the block free unit can track,or otherwise determine, the number of block identifiers currently in thesubset and/or pending insertion into the subset (including thenotification received at block 902) and compare the number of blockidentifiers to the size threshold. The size threshold might also bebased on a quantity of data, such as bytes, instead of or in conjunctionwith a count of block identifiers. If the number of block identifiers isequal to the size threshold, the block free unit determines that theaddition of the block identifier will cause the subset to reach the sizethreshold. If the block free unit determines that the addition of ablock identifier to the active log will cause a subset of the active logto reach the particular size threshold, control then flows to block 906.If the block free unit determines that the addition of a blockidentifier to the active log will not cause a subset of the active logto reach the particular size threshold, control then flows to block 908.

At block 906, the block free unit indicates that a subset should besorted. Typically, the block free unit maintains a count of the numberof subsets that should be sorted. To indicate that a subset should besorted, the block free unit can increment the count of the number ofsubsets that should be sorted. The block free unit might also maintain adata structure that indicates which specific subsets should be sorted.In such instances, the block free unit can add an indication of thespecific subset that should be sorted to the data structure thatindicates which specific subsets should be sorted. After the block freeunit indicates that a subset should be sorted, control then flows toblock 908.

Control flowed to block 908 if it was determined, at block 904, that theaddition of a block identifier to the active log will cause a subset ofthe active log to reach the particular size threshold. Control alsoflowed to block 908 from block 906. At block 908, the block free unitdetermines whether one or more conditions for tight mode are present.For example, the block free unit might determine whether the aggregatesize of the block logs has reached a particular threshold, whether oneor more individual block logs have reached a particular threshold, or ifa particular amount of time has elapsed. The block free unit can alsocheck to see if a status variable indicates that one or more conditionsfor tight mode are present. For example, the block free unit mightperiodically analyze the size of the block logs. If the block free unitdetermines that the size of the block logs are greater than theparticular threshold, the block free unit might set a variable toindicate that the block free unit is in tight mode. If the block freeunit determines that the size of the block logs are not greater than theparticular threshold, the block free unit might set the variable toindicate that the block free unit is not in tight mode. Thus, todetermine whether the one or more conditions for tight mode are present,the block free unit might determine whether the variable is set toindicate that the block free unit is in tight mode. If the block freeunit determines that one or more conditions for tight mode are present,control then flows to block 910. If the block free unit determines thatone or more conditions for tight mode are not present, the process ends.

At block 910, the block free unit determines whether the sorted log isbeing generated. To determine whether the sorted log is being generated,the block free unit can determine whether the block free unit (orcomponent therein) is in a state that corresponds to the generation ofthe sorted log (such as the “MERGE” state described above). If the blockfree unit determines that the sorted log is being generated, controlthen flows to block 912. If the block free unit determines that thesorted log is not being generated, control then flows to block 914.

At block 912, the block free unit indicates that at least one blockidentifier should be merged into the sorted log. Typically, the blockfree unit maintains a count of the number of block identifiers thatshould be merged into the sorted log. Thus, to indicate that at leastone block identifier should be merged into the sorted log, the blockfree unit can increment the count of the number of block identifiersthat should be merged into the sorted log. The block free unit mightalso indicate that more than one block identifier should be merged intothe sorted log. For example, the block free unit might increment thecount of the number of block identifiers that should be merged by two.After the block free unit indicates that at least one block identifiershould be merged into the sorted log, the process ends.

Control flowed to block 914 if it was determined, at block 910, that thesorted log is not being generated. At block 914, the block free unitdetermines whether the block identifiers in the sorted log are beingprocessed. To determine whether the block identifiers in the sorted logare being processed, the block free unit can determine whether the blockfree unit (or component therein) is in a state that corresponds to theblock identifiers in the sorted log being processed (such as the “FREE”state described above). If the block free unit determines that the blockidentifiers in the sorted log are being processed, control then flows toblock 916. If the block free unit determines that the block identifiersin the sorted log are not being processed, the process ends.

At block 916, the block free unit indicates that at least one blockidentifier in the sorted log should be processed. Typically, the blockfree unit maintains a count of the number of block identifiers in thesorted log that should be processed. Thus, to indicate that at least oneblock identifier in the sorted log should be processed, the block freeunit can increment the count of the number of block identifiers in thesorted log that should be processed. The block free unit might alsoindicate that more than one block identifier in the sorted log should beprocessed. For example, the block free unit might increment the count ofthe number of block identifiers in the sorted log that should beprocessed by two. After the block free unit indicates that at least oneblock identifier in the sorted log should be processed, the processends.

The example operations depicted in FIG. 9 are described as occurring inresponse to receiving, at block 900, a single notification that a blockis to be freed. However, as described in relation to FIG. 8, theoperations can be performed responsive to receiving multiplenotifications. In other words, the operations depicted at blocks 902through 916 might only be performed after receiving a specific number ofnotifications or after periodic time intervals instead of beingperformed after each notification is received.

Further, it is assumed that when one unit of frontend work is receivedthe resources for at least one unit of backend work are reserved. Forexample, for each received notification that a block is to be freed, atleast one block identifier is processed from the sorted log (if thesorted log has been generated). However, a block free unit can beconfigured to reserve less than one full unit of backend work per unitof frontend work. For example, when not in tight mode, the block freeunit might reserve one unit of backend work for every two units offrontend work received. Relatedly, when extra resources are available,such as when processor utilization is low, additional units of backendwork may be reserved, thus allowing the extra resources to be utilized.Utilizing the extra resources can allow the block free unit toeffectively get ahead in the processing, reducing the impact of a suddenincrease in block free indications.

The specific resource reservation mechanisms can vary. For example,tight mode might be the only time in which the sorted log is generatedand the metadata associated with block identifiers in the sorted log isupdated. In other words, a sorted merge unit and a free unit (such asthe sorted merge unit 810 and the free unit 812 depicted in FIG. 8)might only perform operations when in tight mode. In some storagesystems, however, the operations to generate the sorted log and updatethe metadata might be performed even when not in tight mode. Forexample, when not in tight mode, a sorted merge unit and a free unitmight perform operations based on the computing resources available. Inother words, if there are computing resources that are not beingutilized, the operations to generate the sorted log and update themetadata can be performed to utilize those computer resources. Whentight mode is enabled, a sorted merge unit and a free unit might performoperations indicated by a workload management unit, regardless of theavailability of computing resources. Thus, tight mode can be one ofseveral modes of operation in which various operations are performed toaccount for current or anticipated conditions.

It should be noted that the descriptions above refer to “indications”that blocks are to be freed as well as “notifications” that blocks areto be freed. In practice, an indication that a block is to be freed anda notification that a block is to be freed can be the same thing. Thedescriptions herein, however, use the term “indication” when describinga communication or message that includes an identification of a block.The term “notification” is used when describing a communication ormessage that optionally includes an identification of a block. However,in practice, an “indication” can serve as a “notification” and viceversa. Consider, for example, the operations described in relation toFIG. 8. As described above, at stage B of FIG. 8 the workload managementunit 804 receives “notifications” that blocks are to be freed. Thesubsequent operations performed by the workload management unit 804 canbe performed without specific block identifiers. However, the workloadmanagement unit 804 can perform the same operations even if thenotifications received identified specific blocks. In general, the terms“indication”, “notification”, “communication”, “message”, etc. describesimilar concepts and should not be construed to refer to distinctconcepts unless otherwise indicated.

The examples herein assume that a block log is a list of blockidentifiers. As such, appending a block identifier to a block log isfunctionally equivalent to inserting the block identifier at the end ofthe block log. However, in storage systems that implement the block logusing a different format or data structure, the operation of insertingthe block identifier into the block log may vary accordingly. Forexample, if the block log is implemented as a tree, the operationsperformed to insert a block identifier into the block log can compriseoperations corresponding to inserting a node in a tree.

As example flowcharts, FIGS. 4, 5, 6, and 9 present operations in anexample order from which storage systems can deviate (e.g., operationscan be performed in a different order than illustrated and/or inparallel; additional or fewer operations can be performed, etc.). Forexample, FIG. 9 depicts the operations at blocks 912 and 916 as beingmutually exclusive. However, in a storage system in which blockidentifiers can be merged into a sorted log while block identifiers inthe sorted log are being processed, the operations performed at blocks912 and 916 might not be mutually exclusive.

As will be appreciated by one skilled in the art, aspects of thedisclosures herein may be embodied as a system, method or computerprogram product. Accordingly, aspects of the disclosures herein may takethe form of an entirely hardware implementation, an entirely softwareimplementation (including firmware, resident software, micro-code, etc.)or an implementation combining software and hardware aspects that mayall generally be referred to herein as a “circuit,” “module” or“system.” Furthermore, aspects of the disclosures herein may take theform of a program product embodied in one or more machine readablemedium(s) having machine readable program code embodied thereon.

Any combination of one or more machine readable medium(s) may beutilized. The machine readable medium may be a machine readable signalmedium or a machine readable storage medium. A machine readable storagemedium may be, for example, a system, apparatus, or device that useselectronic, magnetic, optical, electromagnetic, infrared, orsemiconductor technology, or a combination thereof. More specificexamples (a non-exhaustive list) of the machine readable storage mediumwould include the following: a portable computer diskette, a hard disk,a random access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), a portablecompact disc read-only memory (CD-ROM), an optical storage device, amagnetic storage device, or any suitable combination of the foregoing.In the context of this document, a machine readable storage medium maybe any tangible medium that can contain, or store a program for use byor in connection with an instruction execution system, apparatus, ordevice. A machine readable storage medium does not include transitory,propagating signals.

A machine readable signal medium may include a propagated data signalwith machine readable program code embodied therein, for example, inbaseband or as part of a carrier wave. Such a propagated signal may takeany of a variety of forms, including, but not limited to,electro-magnetic, optical, or any suitable combination thereof. Amachine readable signal medium may be any machine readable medium thatis not a machine readable storage medium and that can communicate,propagate, or transport a program for use by or in connection with aninstruction execution system, apparatus, or device.

Program code embodied on a machine readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Program code for carrying out operations for aspects of the disclosuresherein may be written in any combination of one or more programminglanguages, including an object oriented programming language such as theJava® programming language, C++ or the like; a dynamic programminglanguage such as Python; a scripting language such as Perl programminglanguage or PowerShell script language; and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages. The program code may execute entirely on astand-alone machine, may execute in a distributed manner across multiplemachines, and may execute on one machine while providing results and oraccepting input on another machine. Examples of a machine that wouldexecute/interpret/translate program code include a computer, a tablet, asmartphone, a wearable computer, a robot, a biological computing device,etc.

FIG. 10 depicts an example computer system with a block free unit. Acomputer system includes a processor 1001 (possibly including multipleprocessors, multiple cores, multiple nodes, and/or implementingmulti-threading, etc.). The computer system includes memory 1007. Thememory 1007 may be system memory (e.g., one or more of cache, SRAM,DRAM, zero capacitor RAM, Twin Transistor RAM, eDRAM, EDO RAM, DDR RAM,EEPROM, NRAM, RRAM, SONOS, PRAM, etc.) or any one or more of the abovealready described possible realizations of machine-readable media. Thecomputer system also includes a bus 1003 (e.g., PCI, ISA, PCI-Express,HyperTransport®, InfiniBand®, NuBus, etc.), a network interface 1005(e.g., an ATM interface, an Ethernet interface, a Frame Relay interface,SONET interface, wireless interface, etc.), and a storage device(s) 1009(e.g., optical storage, magnetic storage, etc.). The block free unit1011 embodies functionality to implement features described above. Theblock free unit 1011 may perform operations that facilitate increasingthe efficiency of metadata updates related to freeing blocks of data.The block free unit 1011 may perform operations that facilitate loggingindications of blocks that should be free, sorting the indications toincrease spatial locality of the indications, and updating metadataassociated with the blocks. Any one of these functionalities may bepartially (or entirely) implemented in hardware and/or on the processor1001. For example, the functionality may be implemented with anapplication specific integrated circuit, in logic implemented in theprocessor 1001, in a co-processor on a peripheral device or card, etc.Further, realizations may include fewer or additional components notillustrated in FIG. 10 (e.g., video cards, audio cards, additionalnetwork interfaces, peripheral devices, etc.). The processor 1001, thestorage device(s) 1009, and the network interface 1005 are coupled tothe bus 1003. Although illustrated as being coupled to the bus 1003, thememory 1007 may be coupled to the processor 1001.

While the examples are described with reference to variousimplementations and exploitations, it will be understood that theseexamples are illustrative and that the scope of the disclosures hereinis not limited to them. In general, techniques for freeing blocks ofdata as described herein may be implemented with facilities consistentwith any hardware system or hardware systems. Many variations,modifications, additions, and improvements are possible.

Plural instances may be provided for components, operations orstructures described herein as a single instance. Finally, boundariesbetween various components, operations and data stores are somewhatarbitrary, and particular operations are illustrated in the context ofspecific illustrative configurations. Other allocations of functionalityare envisioned and may fall within the scope of the disclosures herein.In general, structures and functionality presented as separatecomponents in the example configurations may be implemented as acombined structure or component. Similarly, structures and functionalitypresented as a single component may be implemented as separatecomponents. These and other variations, modifications, additions, andimprovements may fall within the scope of the disclosures herein.

Use of the phrase “at least one of . . . or” should not be construed tobe exclusive. For instance, the phrase “X comprises at least one of A,B, or C” does not mean that X comprises only one of {A, B, C}; it doesnot mean that X comprises only one instance of each of {A, B, C}, evenif any one of {A, B, C} is a category or sub-category; and it does notmean that an additional element cannot be added to the non-exclusive set(i.e., X can comprise {A, B, Z}).

What is claimed is:
 1. A method comprising: receiving, at a storageserver, one or more notifications in an incoming workload that one ormore data blocks are to be freed, each of the one or more data blocksincluding an associated block identifier indicating the location of eachof the one or more data blocks on a storage device, the storage servermaintaining a set of data block logs in a memory thereof, the set ofdata block logs including an active log, an inactive log, and a sortedlog; in response to receiving the one or more notifications that the oneor more blocks are to be freed, allocating each of the block identifiersassociated with the one or more data blocks to the active log, the oneor more data blocks corresponding to an amount of memory sufficient tostore, within the active log, each of the block identifiers associatedwith the one or more data blocks to freed; upon determining that addinga further block identifier to the active log will cause a first subsetof the block identifiers of the active log to reach a subset sizethreshold, sorting the first subset of the block identifiers of theactive log; upon determining that adding a further block identifier tothe active log will cause a second subset of the block identifiers ofthe active log to reach a subset size threshold, sorting the secondsubset of the block identifiers of the active log; upon determining thatadding a further block identifier to the active log will cause theactive log to reach an active log size threshold, switching the activelog to designation as the inactive log; and generating the sorted log bymerging at least the first and second, sorted subsets of the blockidentifiers of the inactive log using a modified heapsort, the sortedlog including the freed data blocks; wherein the one or morenotifications that the one or more blocks are to be freed is receivedfrom one of a client computing device and a storage system componentcommunicatively coupled to the storage server; and wherein after whenthe notifications are received, further monitoring the sizes of theactive, inactive and sorted logs.
 2. The method of claim 1, furthercomprising: in response to determining that adding the further blockidentifier will cause the first subset of the block identifiers of theactive log to reach the size threshold, incrementing a first count, thefirst count indicating a number of subsets of the active log forswitching to the inactive log.
 3. The method of claim 1, furthercomprising determining that one or more conditions for a tight mode arepresent, wherein one or more of the active, inactive and sorted logsexceeds one of a static and a dynamically determined size threshold. 4.The method of claim 3, wherein said determining that one or moreconditions for the tight mode are present comprises: determining that anamount of time greater than a predetermined time threshold has elapsed.5. The method of claim 1 further comprising: in response to receivingthe one or more notifications that the one or more data blocks are to befreed, incrementing a second count, wherein the second count indicates anumber of data blocks that are to be freed.
 6. The method of claim 1further comprising: updating metadata associated with the blockidentifiers of the freed data blocks.
 7. A non-transitory machinereadable medium having stored thereon instructions for performing amethod comprising machine executable code which when executed by atleast one machine, causes the machine to: receive, at a storage server,one or more notifications in an incoming workload that one or more datablocks are to be freed, each of the one or more data blocks including anassociated block identifier indicating the location of each of the oneor more data blocks on a storage device, the storage server maintaininga set of data block logs in a memory thereof, the set of data block logsincluding an active log, an inactive log, and a sorted log; in responseto receiving the one or more notifications that the one or more blocksare to be freed, allocate each of the block identifiers associated withthe one or more data blocks to the active log, the one or more datablocks corresponding to an amount of memory sufficient to store, withinthe active log, each of the block identifiers associated with the one ormore data blocks to freed; upon determining that adding a further blockidentifier to the active log will cause a first subset of the blockidentifiers of the active log to reach a subset size threshold, sort thefirst subset of the block identifiers of the active log; upondetermining that adding a further block identifier to the active logwill cause a second subset of the block identifiers of the active log toreach a subset size threshold, sort the second subset of the blockidentifiers of the active log; upon determining that adding a furtherblock identifier to the active log will cause the active log to reach anactive log size threshold, switch the active log to designation as theinactive log; and generate the sorted log by merging at least the firstand second, sorted subsets of the block identifiers of the inactive logusing a modified heapsort, the sorted log including the freed datablocks; wherein the one or more notifications that the one or moreblocks are to be freed is received from one of a client computing deviceand a storage system component communicatively coupled to the storageserver; and wherein after when the notifications are received, furthermonitor the sizes of the active, inactive and sorted logs.
 8. Thenon-transitory storage medium of claim 7, wherein the machine executablecode further causes the machine to: in response to determining thatadding the further block identifier will cause the first subset of theblock identifiers of the active log to reach the size threshold,increment a first count, the first count indicating a number of subsetsof the active log for switching to the inactive log.
 9. Thenon-transitory storage medium of claim 7, wherein the machine executablecode further causes the machine to: determine that one or moreconditions for a tight mode are present, wherein one or more of theactive, inactive and sorted logs exceeds one of a static and adynamically determined size threshold.
 10. The non-transitory storagemedium of claim 9, wherein to determine the one or more conditions forthe tight mode are present includes determining that an amount of timegreater than a predetermined time threshold has elapsed.
 11. Thenon-transitory storage medium of claim 7, wherein the machine executablecode further causes the machine to: in response to receiving the one ormore notifications that the one or more data blocks are to be freed,increment a second count, wherein the second count indicates a number ofdata blocks that are to be freed.
 12. The non-transitory storage mediumof claim 7, wherein the machine executable code further causes themachine to: update metadata associated with the block identifiers of thefreed data blocks.
 13. A system, comprising: a memory containing machinereadable medium comprising machine executable code having stored thereoninstructions for performing a method; and a processor coupled to thememory, the processor configured to execute the machine executable codeto cause the processor to: receive, at a storage server, one or morenotifications in an incoming workload that one or more data blocks areto be freed, each of the one or more data blocks including an associatedblock identifier indicating the location of each of the one or more datablocks on a storage device, the storage server maintaining a set of datablock logs in a memory thereof, the set of data block logs including anactive log, an inactive log, and a sorted log; in response to receivingthe one or more notifications that the one or more blocks are to befreed, allocate each of the block identifiers associated with the one ormore data blocks to the active log, the one or more data blockscorresponding to an amount of memory sufficient to store, within theactive log, each of the block identifiers associated with the one ormore data blocks to freed; upon determining that adding a further blockidentifier to the active log will cause a first subset of the blockidentifiers of the active log to reach a subset size threshold, sort thefirst subset of the block identifiers of the active log; upondetermining that adding a further block identifier to the active logwill cause a second subset of the block identifiers of the active log toreach a subset size threshold, sort the second subset of the blockidentifiers of the active log; upon determining that adding a furtherblock identifier to the active log will cause the active log to reach anactive log size threshold, switch the active log to designation as theinactive log; and generate the sorted log by merging at least the firstand second, sorted subsets of the block identifiers of the inactive logusing a modified heapsort, the sorted log including the freed datablocks; wherein the one or more notifications that the one or moreblocks are to be freed is received from one of a client computing deviceand a storage system component communicatively coupled to the storageserver; and wherein after when the notifications are received, furthermonitor the sizes of the active, inactive and sorted logs.
 14. Thesystem of claim 13, wherein the machine executable code further causesthe processor to: in response to determining that adding the furtherblock identifier will cause the first subset of the block identifiers ofthe active log to reach the size threshold, increment a first count, thefirst count indicating a number of subsets of the active log forswitching to the inactive log.
 15. The system of claim 13, wherein themachine executable code further causes the processor to: determine thatone or more conditions for a tight mode are present, wherein one or moreof the active, inactive and sorted logs exceeds one of a static and adynamically determined size threshold.
 16. The system of claim 15,wherein to determine the one or more conditions for the tight mode arepresent includes determining that an amount of time greater than apredetermined time threshold has elapsed.
 17. The system of claim 13,wherein the machine executable code further causes the processor to: inresponse to receiving the one or more notifications that the one or moredata blocks are to be freed, increment a second count, wherein thesecond count indicates a number of data blocks that are to be freed. 18.The system of claim 13, wherein the machine executable code furthercauses the processor to: update metadata associated with the blockidentifiers of the freed data blocks.